fix(operator): un-gate readiness on needs-reindex-all, REINDEX DATABASE CONCURRENTLY by passcod · Pull Request #69 · beyondessential/postgres-restore-operator

passcod · 2026-06-10T03:51:25Z

🤖

Summary

Two changes that together unblock nauru-prod (and any replica whose snapshot triggers pg_resetwal):

Drop the amcheck-driven smart pass for needs-reindex-all; replace it with a blind REINDEX DATABASE per user DB — CONCURRENTLY on PG ≥ 12 so clients can keep using the old indexes during the rebuild.
Drop the readiness probe's gate on /pgdata/needs-reindex-all. Keep the gate on the locale-only /pgdata/needs-reindex (small, fast, proven).

Why

Verified live on nauru-prod after #68 deployed. Every restore attempt failed at the operator's 30-minute deployment_ready_timeout. Tracing inside a fresh pod:

/pgdata/needs-reindex-all flag present (pg_resetwal did fire — expected on this source).
pg_stat_activity shows the background reindex hook running.
bt_index_check('pg_type_oid_index'::regclass) (a tiny system catalog index, normally a few KB) — state active, empty wait_event, query_start growing linearly with wall clock. 4+ minutes on that one index with no visible progress.

Same shape as the CREATE TABLE spinloop documented earlier on this dataset: vanilla DDL pegs 100% CPU forever inside postgres with no visible reason. The smart pass via amcheck can't avoid the hammer — it IS a hammer that triggers the same wedge.

REINDEX DATABASE CONCURRENTLY reads the heap (not the index) to rebuild from scratch — a completely different code path that isn't subject to whatever amcheck is hitting in the corrupt index pages. PG ≥ 12 syntax; falls back to blocking REINDEX DATABASE on older.

But REINDEX DATABASE on the prod-sized DBs takes hours. Gating readiness on it makes deployment_ready_timeout (30 min default) trip every time and the restore fails. So the second change is necessary: let the pod become Ready as soon as postgres accepts connections; let the reindex run in background. CONCURRENTLY means the corrupt indexes stay queryable until the atomic swap, narrowing the window where clients can see corruption.

Caveats

REINDEX DATABASE CONCURRENTLY skips system catalogs (PG won't CONCURRENTLY them). For an analytics replica that's the right trade — user-data indexes are what client queries hit; system catalog corruption surfaces as different errors and is rare.

Trade-off

Clients hitting a not-yet-reindexed corrupt index get the explicit unexpected zero page at block N error from postgres — same error the downstream users originally reported. With CONCURRENTLY the window is narrow (queries hit the old index until the atomic swap); they retry, succeed once the rebuild lands.

Strictly better than the alternative (permanently-failed restore, replica stuck indefinitely on the previous Active).

Shape

builders.rs: the needs-reindex-all branch in the postgres container startup hook replaces the amcheck loop with REINDEX DATABASE CONCURRENTLY "$db" per user database (with a PG < 12 fallback). Comment explains the wedge and why amcheck is wrong here.
builders.rs: readiness probe drops the needs-reindex-all check. Comment explains.
Tests: updated assertions — runtime hook uses REINDEX DATABASE CONCURRENTLY, gates on PG_MAJOR ≥ 12, no bt_index_check; readiness probe gates only on the locale-only flag.

Recovery

For replicas already in the failure loop (pre-fix gate), the readiness probe will start returning success on its next check after the operator rolls the new image (the needs-reindex-all file may still be there, but the probe no longer cares). The background reindex keeps churning. For the current live nauru-prod pod, I removed the flag by hand to unblock immediately — the next restore will exercise the new code path.

…SE CONCURRENTLY The amcheck-driven smart pass introduced in #68 hits the same family of postgres-internal pathology that wedges other vanilla DDL on the prod dataset — bt_index_check itself burns 100% CPU forever on specific indexes (observed via pg_stat_activity: state=active, empty wait_event, query_start growing linearly with wall clock for 4+ minutes on a tiny system catalog index). The smart pass is unusable on this data. Two changes: 1. Drop bt_index_check; revert the needs-reindex-all branch to a blind REINDEX DATABASE per user DB, CONCURRENTLY on PG ≥ 12 so clients can keep using the old indexes during the rebuild and the atomic swap makes corruption disappear without downtime. REINDEX reads the heap and rebuilds the index from scratch — a different code path from amcheck (which reads corrupt index pages directly) — so it isn't subject to the same wedge. Slow on prod-sized DBs (hours) but makes progress. 2. Drop the readiness probe's gate on /pgdata/needs-reindex-all. With the reindex taking hours, gating readiness here trips the operator's 30-minute deployment_ready_timeout and the restore is marked Failed before postgres even has a chance to come up. The probe still gates on /pgdata/needs-reindex (locale-only, finishes in seconds) since that's small and proven. Trade-off: clients hitting a not-yet-reindexed corrupt index in the window between pod-Ready and reindex-complete get the explicit "unexpected zero page" error from postgres. With CONCURRENTLY the window is narrow (queries hit the old index until the atomic swap) and clients can retry. Strictly better than the alternative (permanently failed restore, replica stuck indefinitely on the previous Active). The pre-#68 behaviour for the locale-only path is unchanged.

passcod force-pushed the operator-reindex-no-readiness-gate branch from 1f898f7 to 6987f64 Compare June 10, 2026 03:53

passcod changed the title ~~fix(operator): un-gate readiness on needs-reindex-all, revert to blind REINDEX DATABASE~~ fix(operator): un-gate readiness on needs-reindex-all, REINDEX DATABASE CONCURRENTLY Jun 10, 2026

passcod enabled auto-merge June 10, 2026 03:59

passcod merged commit e9e0b9f into main Jun 10, 2026
18 checks passed

passcod deleted the operator-reindex-no-readiness-gate branch June 10, 2026 04:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(operator): un-gate readiness on needs-reindex-all, REINDEX DATABASE CONCURRENTLY#69

fix(operator): un-gate readiness on needs-reindex-all, REINDEX DATABASE CONCURRENTLY#69
passcod merged 1 commit into
mainfrom
operator-reindex-no-readiness-gate

passcod commented Jun 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

passcod commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Caveats

Trade-off

Shape

Recovery

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

passcod commented Jun 10, 2026 •

edited

Loading